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Abstract 

Li^ ' We consider semi-supervised classification when part of the available data is unlabeled. 

ry^ . These unlabeled data can be useful for the classification problem when we make an assumption 

relating the behavior of the regression function to that of the marginal distribution. Seeger |18| 

proposed the well-known cluster assumption as a reasonable one. We propose a mathematical 
C^ . formulation of this assumption and a method based on density level sets estimation that takes 

advantage of it to achieve fast rates of convergence both in the number of unlabeled examples 

and the number of labeled examples. 
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t::^ ■ 1 Introduction 

^O ' Semi-supervised classification has been of growing interest over the past few years and many methods 

^P I have been proposed. The methods try to give an answer to the question; "How to improve classifi- 

r~| ■ cation accuracy using unlabeled data together with the labeled data?" . Unlabeled data can be used 

in different ways depending on the assumptions on the model. There are two types of assumptions. 
rH . The first one consists in assuming that we have a set of potential classifiers and we want to aggregate 

J~J ' them. In that case, unlabeled data is used to measure the compatibility between the classifiers and 

K^ ■ reduces the complexity of the resulting classifier (see, e.g., [3, 0I)- The second approach is the one 

that we use here. It assumes that the data contains clusters that have homogeneous labels and the 
unlabeled observations are used to identify these clusters. This is the so-called cluster assumption. 
C^ I This idea can be put in practice in several ways giving rise to various methods. The simplest is the 

one presented here: estimate the clusters, then label each cluster uniformly. Most of these methods 
use Hartigan's JI] definition of clusters, namely the connected components of the density level sets. 
However, they use a parametric (usually mixture) model to estimate the underlying density which 
can be far from reality. Moreover, no generalization error bounds are available for such methods. In 
the same spirit, 20 and '17' propose methods that learn a distance using unlabeled data in order to 
have intra-cluster distances smaller than inter-clusters distances. The whole family of graph-based 
methods aims also at using unlabeled data to learn the distances between points. The edges of the 
graphs reflect the proximity between points. For a detailed survey on graph methods we refer to |23j . 
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Finally, we mention kernel methods, where unlabeled data are used to build the kernel. Recalling 
that the kernel measures proximity between points, such methods can also be viewed as learning a 
distance using unlabeled data (see (2], |7], |H1)- 

The cluster assumption can be interpreted in another way, i.e., as the requirement that the 
decision boundary has to lie in low density regions. This interpretation has been widely used in 
learning since it can be used in the design of standard algorithms such as Boosting ^, J^I or SVM 
(Hji |Z|i which are closely related to kernel methods mentioned above. In these algorithms, a greater 
penalization is given to decision boundaries that cross a cluster. For more details, see, e.g., JBli |^-^| • 
[5]. Although most methods make, sometimes implicitly, the cluster assumption, no formulation in 
probabilistic terms has been provided so far. The formulation that we propose in this paper remains 
very close to its original text formulation and allows to derive generalization error bounds. We 
also discuss what can and cannot be done using unlabeled data. One of the conclusions is that 
considering the whole excess-risk is too ambitious and we need to concentrate on a smaller part of 
it to observe the improvement of semi-supervised classification over standard classification. 

Outline of the paper. After describing the model, we formulate the cluster assumption and discuss 
why and how it can improve classification performance in the next section. In Section O we study 
the population case when the marginal density p is known, to get an idea of our target. Indeed, 
such a population case corresponds in some way to the case when the amount of unlabeled data is 
infinite. Section ^ contains the main result: we propose an algorithm for which we derive rates of 
convergence for the A-thresholded excess-risk as a measure of performance. An exemple of consistent 
density level set estimators is given in Section |S| Section |B1 is devoted to discussion on the choice of 
A and possible improvements. Proofs of the results are gathered in Sectional 

Notation. Throughout the paper, we denote by Cj positive constants. We write F^ for the com- 
plement of the set F. For two sequences {up)p and {vp)p (in that paper, p will be m or n), we 
write Up — 0{vp) if there exists a constant C > such that Up < Cvp and we write Up — 0{vp) if 
Up < C{logp)"vp for some constants a > 0,C > 0. Thus, if Up = 0{vp), we have Up = o{vpP^)^ for 
any /3 > 0. 

2 The model 

Let (X, Y) be a random couple with joint distribution P, where X £ A" C H'^ is a vector of d 
features and Y e {0, 1} is a label indicating the class to which X belongs. The distribution P of 
the random couple (X, Y) is completely determined by the pair {Px , if) where Px is the marginal 
distribution of X and rj is the regression function of Y on X, i.e., 77(2;) = P{Y — 1\X — x). The goal 
of classification is to predict the label Y given the value of X, i.e., to construct a measurable function 
g : X ^ {0, 1} called a classifier. The performance of g is measured by the average classification 
error 

R{g)^P{g{X)^Y) 

A minimizer of the risk R{g) over all classifiers is given by the Bayes classifier g*{x) = II{r)(a)>i/2}) 
where I/.i, denotes the indicator function. Assume that we have a sample of n observations 
{Xi,Yi), . . . ,{Xn,Yn) that are independent copies of {X,Y). An empirical classifier is a random 
function g„ : A" ^ {0, 1} constructed on the basis of the sample {Xi,Yi), . . . , (X„, Yn). Since g* is 
the best possible classifier, we measure the performance of an empirical classifier g„ by its excess-risk 

£{gn) = lE„i?(5„) - R{g*) , 



where 1E„ denotes the expectation with respect to the joint distribution of {Xi,Yi), . . . , [XmY, 
We denote hereafter by P„ the corresponding probability. 



n ) • 



In many applications, a large amount of unlabeled data is available as well as a small set of 
labeled data {Xi,Yi), ... , (X„,y„) and the goal of semi-supervised classification is to use of the 
unlabeled data to improve the performance of classifiers. Thus, we observe two independent samples 
X; = {(^1, ^l), • • ■ , {Xm Yn)} and X„ = {X„+i, . . . , X„+,„}, where n is rather small and typically 
m ^ n. It is well known that in order to make use of the additional unlabeled observations, we have 
to make an assumption on the dependence between the marginal distribution of X and the joint 
distribution of {X,Y). Seeger |18| formulated the rather intuitive cluster assumption as follows^ 

Two points x,x' € X should have the same label y if there is a path between them which 
passes only through regions of relatively high Px • 

This assumption, in its raw formulation cannot be exploited in the probabilistic model since (z) the 
labels are random variables Y, Y' so that the expression "should have the same label" is meaningless 
unless 1] takes values in {0, 1} and (ii) it is not clear what "regions of relatively high Px" are. To 
match the probabilistic framework, we propose the following modifications 

(i) P[Y = y |X, ^'connected] > P[Y ^ F'|X, ^'connected], where "connected" means that there 
is the path between X and X' which passes only through regions of relatively high Px . 

(a) Define "regions of relatively high Px" in terms of density level sets. 

We now need to precise the term relatively high density. Assume that Px admits a density p with 
respect to the Lebesgue measure on H'^ denoted hereafter by Leb^. For a fixed A > 0, the A-level 
set of the density p is defined by 

r{X) ^ {x e X : p{x) > X} . (2.1) 

We are now in position to give a precise definition of the cluster assumption. 

Cluster Assumption CA(A): Fix A > and assume that the density level set F — F(A) has 
a countable number of connected components Tj — Tj(A), j = 1,2, . . .. Then the function 
X ^ X ^^ \\r\{x) > 1/2} takes a constant value on each of the Tj,j — 1,2, . . .. 

Note that density level sets have the monotonicity property: A > A', implies r(A) C F(A'). 
In terms of the cluster assumption, it means that when A decreases to 0, the assumption CA(A) 
becomes more restrictive. As a result, the parameter A can be considered as a level of confidence 
characterizing to which extent the cluster assumption is valid for the distribution P and its choice 
is left to the user. For more details on the choice of A, see Sectional 

A question remains: what happens outside of the set r(A)? Assume that we are in the problematic 
case, Px(F'^) = C > such that the question makes sense. Since the cluster assumption says 
nothing about what happens outside of the set F, we can only perform supervised classification on 
F^. Consider now a classifier gn,m built from labeled and unlabeled samples (X;, X„) pooled together. 
The excess- risk of gn.m can be written (see |10p 

£{9n,m) = ^n,m / |2??(x) - 1 1 1{g„ ,^(^x)^g^^)}p{x)dx , 

Jx 

where TEn,m denotes the expectation with respect to the pooled sample (X;, X„). We denote hereafter 
by ]Ri,m the corresponding probability. Since, the unlabeled sample is of no help to classify points 
in F'^, any reasonable classifier should be based on the sample X; so that gn.rn{x) = gn{x), Vx £ F'^, 
and we have 

£ign,m) = £{9n) > IE„ / \2r]{x) - l\llg^^^)^g* (^)yp{x)dx . (2.2) 

Jt" 

^the notation is adapted to the present framework 



Since we assumed Pxi^"^) — C > 0, the RHS of (|2.2|) is bounded from below by the optimal 
rates of convergence that appear in supervised classification. These rates are typically of the order 
Ti~°',l/2 < a < 1 (see e.g. JS], |221j |2] and [S] for a comprehensive survey). Thus, unlabeled 
data do not improve the rate of convergence of this part of the excess-risk. To observe the effect of 
unlabeled data on the rates of convergence, we have to consider the X-thresholded excess-risk of a 
classifier gn,m defined by 

£\{gn.ni)-^n,-m I |277(a;) - 1 1 I^g^ „(:r)/g* (x)}p(a;)da; . (2.3) 

Jt(\) 

We will therefore focus on this measure of performance. Note that for such a measure, we only need 
to consider classifiers gn,m that are defined on F. 

We now propose a method to obtain good upper bounds on this quantity, taking advantage of 
the cluster assumption. The idea is to estimate the regions where the sign of (77 — 1/2) is constant 
and make a majority vote on each region. 



3 Results for known marginal distribution 

Consider the ideal situation where the density p is known and we observe only the labeled sample 
X; = {(Xi, Yi), . . . , (X„,l^)}. Fix A > and assume that F = F(A) has a countable number of 
connected components: 

where the Tj = Tj{X) are non empty disjoint connected sets. Under the cluster assumption CA(A), 
the function x i— > 'q{x) — 1/2 has constant sign on each Tj. Thus a simple and intuitive method for 
classification is to perform a majority vote on each Tj. 
For any j > 1, define 5j — 5j{\) > 0, 6j < 1 by 

6j = I \2r]{x) ~ l\p{x)dx . 

The following assumption characterizes how far is rj from 1/2 on every connected component Tj. 

Global Margin Assumption GMA(A): There exists S > such that, for any j > 1, either 6j = 
or Sj > S. 

Since X)i ^i — 1' ^ direct consequence of the GMA is that only a finite number of Sj are positive. 
The GMA assumption imposes that, on average over T^, the regression function rj is away from 1/2 
for any j > 1 such that 6j > 0. It describes the global behavior of r] on each connected component 
Tj as opposed to the standard margin assumption formulated in '1^ and |22j which we will call 
here local margin assumption (LMA) . Assumption LMA characterizes the local behavior of ry in a 
neighborhood of 1/2. In 2 , it is stated as follows 

Local Margin Assumption LMA: There exist constants Cq > and a > such that 

Px (0 < \2r]{X) - 1\ < t) < Cot"' , Vt>0. 

It is straightforward that when there is only a finite number of connected components Tj,j = 1, . . . , J 
with non-zero Lebesgue measure, GMA is a consequence of LMA. However we will see in our analysis 



that the rates of convergence depend crucially on the value oi S > 0, j — 1,2, . . ., while deriving 
GMA from LMA yields a 5 depending on Co- For this reason, it is natural to introduce GMA instead 
of using the well known but less flexible LMA. 

We now define our classifier based on the sample X; . For any j > 1, define the random variable 

n 

and denote by g^ the function 5^(x) = ^tz^yo} ^'^^ *^^ 2; G Tj . Consider the classifier defined on F 

by 

9n{x)^Ygi{x)l[^^T,}, x£r. 
i>i 

The following theorem gives exponential rates of convergence for the classifier g„ under CA(A). 
Theorem 3.1 Fix A > and assume that CA(X) holds. Then, the classifier g^ satisfies 

£x{9n)<2Y,5,e'-'V\ (3.1) 

Moreover, if GMA (X) holds, inequality (|3.1|) reduces to 

£x{9n)<2e'^'"'\ (3.2) 

A rapid overview of the proof shows that the rate of convergence e""'' /^ cannot be improved 
without further assumption. It will be our target in semi-supervised classification. However, we 
need estimators of the connected components Tj,j > 1. In the next section we provide the main 
result on semi-supervised learning, that is when the density p is unknown but we can estimate it 
using the unlabeled sample X„. 

4 Main result 

We now deal with a more realistic case where the density p is unknown and so are the density level 
sets which have to be estimated using the unlabeled sample X„ = {Xi, . . . , Xm}- Fix A > and 
assume that F = F(A) has a countable number of connected components: 

where the Tj — 7} (A) are non empty disjoint connected sets. 

4.1 Density level set estimation 

Assume that the density p is uniformly bounded by a constant L{p) and that Lebd(<-f ) < 00, where 
Lebd denotes the Lebesgue measure on M"^. Denote by JPm and IE„i respectively the probability and 
the expectation w.r.t the sample X„ of size m. Assume that for any A > 0, we use the sample X„ to 
construct an estimator Gm = Gm{X) of F = F(A) satisfying 

E„ [Lebrf(G„ A F)] ^ 0, m ^ +^. (4.1) 

We call such estimators consistent estimators of F. However, the connected components of a consis- 
tent estimator of F are not in general consistent estimators of the connected components of F. To 



ensure componentwise consistency, we have to make assumptions on the connected component of T 
and those of G. 

Let B{x, r) be the d-dimensional closed ball of center x G IR'' and radius r > 0, defined by 

B{x,r)^{yeM^:\\y-x\\<r} , 

where || • || denotes the Euclidean norm in M!^. 

Definition 4.1 Fix rp > and cq > 0. We say that a set C C M'^ is ro-connected if for any 
x,x' e C, there exists a continuous map / : [0, 1] — > C such that /(O) — a;, /(I) = x' and for any 
t G [0, 1] and any r < ro, we have 

Leh4B{f{t),r)nC)>cor''. 

A 0-connected set is simply called connected or pathwise connected. 

This definition ensures that F has no flat parts which allows to exclude pathological cases such as the 

one presented on the left of Figure ^ Now, define the distance doo , between two closed connected 

sets Ci and C2 by 

doo(Ci,C2) = min ||x - y\\ 
xeCi 
yeC2 

We say that a collection of connected sets Ci, C2, . . . , is Si^- separated if doo{Cj, Cj') > sq, Vj 7^ j' 
for some sq > 0. If the connected components of F are not so-separated for some sq > 0, cases 
such as the one presented on Figure ^ (right) could arise. In that case, two connected components 
and therefore two clusters are identified which is obviously not desirable. Therefore, the cluster 
assumption should not hold for that particular level A but it might hold for some A' ^ A. 
Note that the performance of a density level estimator Gm is measured by the quantity 

E™ [Lebd(G„, A F)] = TEra [Lebrf(G^ n F)] + ]E„ [Lebrf(G„, n F^)] . (4.2) 

For some estimators, such as the penalized plug-in density level sets estimators presented in Sectional 
we can prove that the dominant term in the RHS of (|4.2|l is lE^ [Lebd(G'^ n F)] . This ensures that 
with high probability the estimator (?,„ is included in F. We now give a precise definition of such 
estimators. 

Definition 4.2 Let Gm be an estimator of F and fix a > 0. We say that the estimator Gm is 
consistent from inside at rate m^" if it satisfies 

]E„[Lebd(G™AF)] ^dim-'') 

and ^ 

E™[Lebd(G,„nF=)] =0{n^-''") 

For fixed a > 0, A > 0, let Gm C A" be a consistent from inside estimator of F at rate tti"". We 
begin by clipping G^ in the following manner. Define the set 

Clip(G™) = {x e G„ : Leb<i(G™ n6(x, (logm)-i)) < Ml^}. 

Note that \jehd{X) < cx) yields 

Lebd(Clip(G„))-0(m-") 



and therefore the chpped set Gm = Gm \ Chp(Gm) is also consistent from inside at rate m~". We 
now use only Gm- It is straightforward that Gm can be decomposed into a finit number J,„ of 
connected components. We write for simplicity 



Gm =[}fi, (4.3) 



I.e., 

D 



i>i 

where Ti depends on m and A. Denote by i/fc, fc = 1, 2, . . ., the family of sets such that 

IJ fi = IJ T^fc , (4.4) 

i>i fc>i 

and doo{Hk, Hki) > 2(logm)^^, "i k ^ k' . It is not hard to see that the sets Hk arc uniquely defined 
from fi, fs, . . .. Let J be a subset of IN* = {1,2,... }. Define k{j) ^ {k : Hk n Tj ^ 0} and let 
D{J) be the event on which the sets k{j),j € J are reduced to singletons {fc(j)} which are disjoint, 

{J) = {<3) = {K3)}M3) + Hf), y.hf e J,jV /} 

. . ^ . (4.5) 

= I'^C?) = {Hj)}, {t, u iife(,)) n {Ty u i7fc(,,)) = 0, Vj, / e J, jV /} • 

In other words, on the event D{J'), there is a one-to-one correspondence between the collection 
{Tj}j^j and the collection {{Hk}keK{j)} p j- Componentwise convergence of Gm to F, is ensured 
when D{TN*) has asymptotically overwhelming probability. The following proposition gives an upper 
bound on the probability of the complementary of D{Sf) under certain conditions including the 
finiteness of J . 

Proposition 4.1 Fix ro > 0, sq > and let J he a subset of {1, 2, . . .} . Assume that {Tj}ji^j is 
a So separated collection of VQ-connected sets. Then, if Gm is an estimator of T that is consistent 
from the inside at rate m^°' , we have 

Pm(D^(J)) = 0(m-"). 

The ro-connectedness of all Tj,j G J and Lebjj(A:') < cx) entails that J is necessarily finite. Nev- 
ertheless, the number of connected components of T can be infinite as long as there is only a finite 
number of them for which 5j = J^ \2ri — l|dPx > 0. 

To estimate the homogeneous regions, we will simply estimate the connected components of T. 
In addition, when two connected components Tj and T^/ are close with respect to the distance doc, 
we merge^ them into the same homogeneous region. 

It yields the following pseudo-algorithm. 

/ Pseudo-Algorithm \^ 

1. Use the unlabeled data X„ to construct an estimator Gm of F that is consistent from 
inside at rate to~". 

2. Define homogeneous regions as the unions of the connected components of Gm = Gm \ 
Clip(G'm) that are closer than 2(logTO)~^ for the distance doc, accordinf to l|4.3|l and (|4.4|l . 

3. Assign a single label to each estimated homogeneous region by majority vote on labeled 
data. 



^Merging two sets means here replacing them by their union 





Figure 1: Set that is 0-connected but not ro-connected for any tq > (left) and non-separated 
connected components (right). 

This method translates into two distinct error terms, one term in m and another term in n. We 
apply our three-step procedure to build a classifier gn,m based on the pooled sample (Xj,X„). Fix 
X > 0,a > and let Gm be an estimator of the density level set T = {p > X}, that is consistent 
from inside with rate m~" . For any fc > 1, define the random variable 



Zt„,^Y.^2Y,-l)^ 



{x^eHk} ' 



where Hk is defined in 14. 4|) . Denote by 5, 
consider the classifier defined on X by 



7,k 



the function (Jn.mi^) = 2{ 



{^S.™>0} 



for all X G Hk and 



k>l 



^i^)^ 



{xe-fffc}' 



ex. 



(4.6) 



Note that the classifier gn,m assigns the label to any x outside oiGm- This is a notational convention 
and we can assign any value to x on this set since we are only interested in the A-thresholded excess- 
risk. Nevertheless, it is more appropriate to assign a label referring to a rejection, e.g., the values 
"2" or "R" (or any other value different from {0, 1}). The rejection meaning that this point should be 
classified using labeled data only. However, when the amount of labeled data is too small, it might 
be more reasonnable not to classify this point at all. This modification is of particular interest in 
the context of classification with a rejection option when the cost of rejection is smaller than the 
cost of misclassification (see, e.g., |12|'). 

Theorem 4.1 Fix A > 0, a > 0,ro > and assume that CA(X) holds. Consider an estimator Gm 
based on X„ that is consistent from inside with rate m^". Then if the connected components ofT{X) 
are r^-connected and SQ-separated, the classifier gn^m defined in (|4.6|) satisfies 



£x 



<0 



1-1 






-n(e5jY/2 



(4.7) 



for any < 6 < 1 . Moreover, if GMA (\) holds, inequality H4.7|) reduces to 

£x i~9n,m) < O (^) + e-"(«*)V2 . (4.8) 

Note that, since we often have m :^ n, the first term in the RHS of H4.7|l and H4.8|) can be considered 
negligible so that we achieve an exponential rate of convergence in n which is almost the same (up 
to the constant 9 in the exponent) as in the case where the density p is known. The constant 9 
seems to be natural since it balances the two terms. 

5 Plug-in rules for density level sets estimation 

Fix A > and recall that our goal is to estimate the connected components Tj = Tj{X), j — 1,2, . . ., 
of r = r(A) = {x € X : p{x) > A}, using the unlabeled sample X„ of size m. A simple and intuitive 
way to achieve this goal is to use plug-in estimators of F defined by 

t^t{X)^{xeX:p^ix)>X} , 

where Pm is some estimator of p. A straightforward generalization are the penalized plug-in estimators 
of F(A), defined by 

f, = f ,(A) ^{xeX: Pmix) >X + e} , 

where ^ > is a penalization. Clearly F^ C F. Therefore the connected components of F^ are farther 
from each other than those of F. Keeping in mind that we want estimators that are consistent from 
inside we are going to consider sufficiently large penalization £ = £(m). 

Plug-in rules have a practical advantage over direct methods such as empirical excess mass 
maximization (see, e.g., ^B], |23> CHI)' Once we have an estimator pm, we can compute the whole 
collection {re{X), A > 0}, which might be of interest for the user who wants to try several values of A. 
Note also that a wide range of density estimators is available in usual software. A density estimator 
can be parametric, typically based on a mixture model, or nonparametric such as histograms or 
kernel density estimators. 

Definition 5.1 For any A,7 > 0, a function f : X ^ ]R is said to have ^-exponent at level X if 
there exists a constant Cq > such that, for all e > 0, 

Lebd{|/(X)-A|<£}<coeT. 

It is an analog of the local margin assumption but for arbitrary level A in place of 1/2. When 7 > 
it ensures that the function / has no flat part at level A. 

The next theorem gives fast rates of convergence for penalized plug-in rules when pm satisfies 
an exponential inequality and p has 7-exponent at level A. Moreover, it ensures that when the 
penalization £ is suitably chosen, the plug-in estimator is consistent from inside. 

Theorem 5.1 Fix A > 0,7 > and A > 0. Let pm be an estimator of the density p such that 
Px{Pm{X) > X) < C , TPm-almost surely for some positive constant C and let V be a class of 
densities on X. Assume that there exist positive constants ci,C2 and a < 1, such that for Px -almost 
all X £ X , uje have 

supP,„(|p„(a;)-p(a;)| >(5) <Cie-'=^"°*', m-"/^ <5 <A. (5.1) 

pev 

Assume further that p has ^-exponent at level X and that the penalty £ is chosen as 

£ = £{m) = m^'^ \ogm. (5.2) 

Then the plug-in estimator Tg is consistent from inside at rate m~^~ . 



Consider a kernel density estimator p^ based on the sample X„ defined by 

-. n+m f Y ~ \ 

i=n+l ^ ' 

where /i > is the bandwidth parameter and ii' : A" — > IR is a kernel. If p is assumed to have Holder 
smoothness parameter /3 > and if K and h are suitably chosen, it is a standard exercise to prove 
inequality of type 1)5. l|l with a = 2f3/{2(3 + d). In that case, it can be shown that the rate m~T" is 
optimal in a minimax sense. 

6 Discussion 

We proposed a formulation of the cluster assumption in probabilistic terms. This formulation relies 
on Hartigan's ^J definition of clusters but it can be modified to match other definitions of clusters 
in the following way. 

Consider a collection of ro-connected and so-separated sets (clusters) Tj^j — 1,2,.... 
Then the function x h^ ('7(2^) — 1/2) has constant sign on each Tj. 

We also proved that there is no hope to improve the classification performance outside of these 
clusters. Based on these remarks, we defined the A-thresholded excess-risk which can be easily 
generalized to the setup of general clusters defined above. Finally we proved that when we have 
consistent estimators of the clusters, it is possible to achieve exponential rates of convergence for the 
A-thresholded excess-risk. The theory developed here can be extended to any definition of clusters 
as long as they can be consistently estimated. 

Note that our definition of clusters is parametrized by A which is left to the user, depending on 
his trust in the cluster assumption. The choice of A can be made by fixing Px(r^), the probability 
of the rejection region. We refer to |2| for more details. Note that data-driven choices of A could be 
easily derived if we impose a condition on the purity of the clusters, i.e. if we are given the 5 in the 
global margin assumption. Such a choice could be made by decreasing A until the level of purity 
is attained. However, any data-driven choice of A has to be made using the labeled data. It would 
therefore yield much worse bounds. 

General open problems are: applying the cluster assumption to other definitions of clusters and 
study the whole excess-risk in the framework of semi-supervised classification with a rejection option. 

7 Appendix: proofs 

7.1 Proof of Theorem IsTTI 

Using the decomposition of F into its connected components, we can decompose £\{gn) into 

£x{gn) = IE„ ^ / |2?7(a;) - l|I^.^(^)_,^,(^)jp(a;)da; . 

Fix j <E {1, 2, . . .} and assume w.l.o.g. that 77 > 1/2 on Tj. It yields g*{x) = 1, Vx G Tj, and since 
cjn is also constant on Tj, we get 



1277(2;) - l|II^-,^(^^_^^,(^^jp(a;)d2; == '^{zi<o} / i'^'ni^) - ^)p{x)dx 
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Taking expectation ]E„ on both sides of (|7.1|l we get 






' n ' 



(7.2) 



< 2<5,e-"^?/2 . 



where we used Hoeffding's inequahty to get the last inequahty. Summing now over j yields the 
theorem. 

7.2 Proof of Proposition l47ll 

Define rrig — exp(l/(ro A sq))- Since the connected components Tj are rg-connected, there is only 
a finite number J > 1 of them. We simply denote D{J) by D. For any j = 1, . . . , J, the tq 
connectedness of Tj yields on the one hand, 

AiO') ^ {cardKj)] - 0} C {Lebd[G™ A T] > Ac(logm)-'*} , 
A2{j) ^ {card[^(i)] > 2} C {Leb4G„, A T] > Ac(logm)-''} . 

The previous inclusions arc illustrated in Figure El 




Figure 2: By construction, Hi and H2 are separated by a ball of radius (log to) ^, which is included 
in B{x, ro) when m > mo. So if {1, 2} C k{j) or k{j) = 0, this ball is included in in F^ A F. 

On the other hand, k(j) fl k(j') ^ for some j' ^ j when either (i) 3 1 s.t. fj n T, 7^ 0, fi n Tj, ^ 
or (ii) 31 ^V s.t. fl n Tj ^ 0, fv n Ty 7^ and doo{fi,fv) < 2{\ogm)-'^. Both cases yield the 
existence of a; e F"^ n Gm such that B{x, (logm)^^) C F^ for m > toq. Therefore 

Lebd(G™nF^)>Lebd(G„n6(x,(logm)-i)) 
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By construction of Gm, we have Lebrf(i3(a;, (log to) ^) nGm) > "m "(logm) '^ . Hence 
^30) = U i'^(^) ^ <^'^ ^ ^K {Lebd(G^ n r^) > m'"(logm)-''} 

Both cases are ihustrated m Figure O 





Now, since 



we get 



I J- 1' 

Figure 3: Case («) (left) and case (m) (right). 



D^=[JA,{j)uA,{j)uAs{j), 



1P„,{D^) < P,„{Lebd[G,„ A r] > Ac(logm)-'^} + F„.{Lebd(G™ n F^) > TO-"(logm)-''} . 

Using the Markov inequality for both terms we obtain 

P„{Leb4G™ AF] > Ac(logTO)-'^} = O (to-°) . 

and ^ 

P„,{Feb<i(G„, nF^) > m-"(logm)-''} = O (m"") 

where we used the fact that Gm is consistent from inside with rate m~°' . It yields the statement of 
the proposition. 

7.3 Proof of Theorem Wl\ 

The A-thresholded excess-risk £x{gn.m) can be decomposed w.r.t the event D and its complement. 
It yields 



Sx{9n,m) < En 



I^IEn^ / \'2r]{x) - l|I{g„^(a;)^g*(a;)}p(x)da; 



+ TPm [D^] 



We now treat the first term of the RHS of the above inequality, i.e., on the event D. Fix j G {1,2,...} 
and assume w.l.o.g. that r] > 1/2 on Tj. Simply write Z'' for Z^^ .^. By definition of D, there is 
a one-to-one correspondence between the collection {Tj}j and the collection {Hk}k- We denote by 
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Hj the unique element of {Hk}k such that Hj D Tj ^ 0. On D, for any j > 1, we have, 

< f (277 - l)dPx + 1eJ1[z^<o} f (27? - l)dFx xj 

< L(p)Lebd(T,- \ G™) + SjlPn{Z' < 0|X„) 
On the event D, For any < 6* < 1, it holds 

P„(Z^ < 0|X„) = P„( / (277 - l)dPx - ^■'' > '5,|X„) 

< P„(|Z^ - / (277 - l)dPx| > 0<5,|X„) 

+ {Px[T,AHj]>(l-e)5,} ■ 

Using Hoeffding's inequality to control the first term, we get 

Taking expectations, and summing over j, the A-thresholded excess-risk is upper bounded by 






where we used the fact that on D, 



Lebd(r A Grn)] + 2^ 5,e-"(''*^)'/2 + p„^ (d^) 



J2 Lebd [Tj A H,] < Leb^ [r A G„] . 
i>i 

From Proposition O we have P„ (Z?^) = O (ttt,"") and E™ Lebd(r A G^) = O {m-") and the 
theorem is proved. 

7.4 Proof of Theorem 15.11 

Recall that 

ft AT = (ten T"] u fq n r^ . 

We begin by the first term. We have 

Tenr'' ^ {x e X -.pmix) > X + £,p{x) < A} C {a; e A": |p„(x) - p{x)\ > i} . 

The Fubini theorem yields 

E™[Lebd(f, nr^)] < Lebd(A') supP„ [\pm{x) - p{x)\ > £] < cge-^^™"^' , 

xex 

where the last inequality is obtained using H5.f|l and C3 = ciLehd{X) > 0. Taking i as in H5.2() yields 
for 771 > exp (70/02), 

E™ [Lebd(f , n r=)] < C377i-^^ (7.3) 
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We now prove that ]Em[Lebd(ff n F'^)] — 0(m 2 ). Consider the following decomposition where 
we drop the dependence in x for notational convenience, 

f^nr = Si UB2, 

where 

Bi^{p„,<X + £,p>\ + 2£} c {\Prn-p\ >e} 

and 

B2^{p,n<X + i,X<p{x) <X + 2£} c {\p-X\ <£}. 

Using (|5.1|) and H5.2|l in the same fashion as above we get IEm[Lebd(i?i)] = 0(m~"2"). The term 
corresponding to B2 is controlled using the 7-exponent of density p at level A. Indeed, we have 

Lehd{B2) < cqP < co(logm)^m"^ = d{m~^) 

The previous upper bounds for Bi and B2 together with H7.3|l yield the consistency from inside. 
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